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Abstract 

Learning object models from views in 3D visual ob- 
ject recognition is usually formulated either as a func- 
tion approximation problem of a function describing 
the view-manifold of an object, or as that of learn- 
ing a class-conditional density. This paper describes 
an alternative framework for learning in visual object 
recognition, that of learning the view-generalization 
function. Using the view-generalization function, an 
observer can perform Bayes-optimal 3D object recog- 
nition given one or more 2D training views directly, 
without the need for a separate model acquisition 
step. The paper shows that view generalization func- 
tions can be computationally practical by restating 
two widely-used methods, the eigenspace and linear 
combination of views approaches, in a view general- 
ization framework. The paper relates the approach to 
recent methods for object recognition based on non- 
uniform blurring. The paper presents results both on 
simulated 3D "paperclip" objects and real-world im- 
ages from the COIL-100 database showing that useful 
view- generalization functions can be realistically be 
learned from a comparatively small number of train- 
ing examples. 



1 Introduction 

Learning view-based or appearance-based models of 
objects has been a major area of research in visual 



*This paper was originally written in November 2003, but 
has been submitted to Arxiv in 2007. References have not been 
updated to include more recent work. 



object recognition (see [5] for reviews) . One direction 
of research has focused on treating the problem of 
learning appearance based models as an interpolation 
problem |16l 114] . Another approach is to treat the 
problem of learning object models as a classification 
problem. 

Both approaches have some limitations. For ex- 
ample, acquiring a novel object may involve fairly 
complex computations or model building. They also 
do not easily explain how an observer can transfer 
his skill at recognizing existing objects to generaliz- 
ing from single or multiple views of novel objects; to 
explain such transfer, a variety of additional meth- 
ods have been explored in the literature, including 
the use of object classes or categories, the acquisi- 
tion and use of object parts, or the adaptation and 
sharing of features or feature hierarchies. 

This paper describes an approach to learning 
appearance-based models that addresses these issues 
in a unified framework: the visual learning problem 
is reformulated as that of learning view generaliza- 
tion functions. The paper shows that knowledge of 
the view generalization function is equivalent to be- 
ing able to carry out Bayes-optimal 3D optimal ob- 
ject recognition for an arbitrary collection of objects, 
presented to the system as training views. Model ac- 
quisition reduces to storing 2D views and does not 
involve learning or model building. 

This represents a significant paradigm shift rela- 
tive to previous approaches to learning in visual ob- 
ject recognition, which have treated the problem of 
acquiring models as a separate learning problems. 
While previous models of visual object recognition 
can be reinterpreted in the framework in this paper 
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(and we will do so for two such methods), the for- 
mulation in terms of view generalization functions 
makes it easy to apply any of a wide variety of stan- 
dard statistical models and classifiers to the problem 
of generalization to novel objects. 

In this paper, I will first express B ayes-optimal 3D 
object recognition in terms of training and target 
views and prior distributions on object models and 
viewpoints. Then, I will describe the statistical basis 
of learning view generalization functions. Finally, I 
will demonstrate, both on the standard "paperclip" 
model and on the COIL- 100 database, that learning 
view generalization functions is feasible. 



2 Bayesian 3D Object Recogni- 
tion 

This section will review 3D object recognition from a 
Bayesian perspective and establish notation. Let us 
look at the question of how an observer can recognize 
3D objects from their 2D views. Let ui identify an 
object and B be an unknown 2D view (we will refer 
to B also as the target view). Then, classifying B 
according to Cj(B) = argmax^ P(u>\B) is well known 
to result in minimum error classification [?]. Using 
Bayes rule, we can rewrite this as 



argmaxP(aj|i?) = argmax 



P{B\bj)P(w) 



P{B) 

arg max P(B \ u)P{u) 



(1) 



P(tjj) is simply the frequency with which object u> 
occurs in the world. Let us try to express P(B\uS) in 
terms of models and/or training views. 

Assume that we are given a 3D object model 
M u . In the absence of noise, the projection of 
this 3D model into a 2D image is determined by 
some function / of the viewing parameters </> G <3>, 
B = /(M u , 4>). The function / usually is rigid body 
transformations followed by orthographic or perspec- 
tive projection. 

In the presence of additive noise, B — f(M U) <j>) + 
N for some amount of noise distributed according 
to some prior noise distribution P(N). With this 




Figure 1: Examples of paperclips used in the simula- 
tions. 

notation, we can now express P{B\uj) in terms of the 
3D object modefj] 



P[B\u) = / 6(B, f(M u , <P) + N)P(<f>)P(N) dtj> dN 

(2) 

To simplify notation below, we write P(B\M u ,(j>) = 
J S(B, f(M u ,</>) + N) P(N) dN and obtain 

(3) 



P(B\uj)= / P(B\M u ,4)Pty)ify 



By construction, Equation [3] represents Bayes- 
optimal 3D model-based recognition, assuming perfect 
knowledge of the 3D model M w for a given object oj. 

In real- world recognition problems, the observer is 
rarely given a correct 3D model M u prior to recogni- 
tion. Instead, the observer needs to infer the model 
from a set of training view^]7^ = {T w l , . . . ,T U >r }. 
Therefore, an observer is faced with the problem of 
determining P(B\cu) as P{B\TJ). In a model-based 
framework, this means that the observer attempts 
to perform reconstruction of the object model M 
given the training views 7^ and then performs recog- 
nition using the resulting distribution of probabili- 
ties over the possible models for recognition. If we 
put this together with Equation [3j we obtain for 
P{B\u) = P{B\%): 



P(B\T^ 



P(B\M,<f>)P(M\T u )P(<f>)dMd<l> (4) 



By construction, P(B\%j) represents the density of 
target views B given a set of training views 7^. 



x 5 is the Dirac delta function. 

2 For the rest of the paper, we limit ourselves to the case 
where the training and test views are drawn in an identical 
manner and independently of one another; the more general 
case in which, say, the training views T u come from a mo- 
tion sequence and hence have sequential correlations in their 
viewing parameters can be treated analogously. 
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Figure 2: Illustration of P(B\T W ). (a) The feature 
vector T u , represented as an image (vertices of the 
clip quantized to a grid), (b) log P{B\T U ) - log P(B) 
(darker=higher probability) . 

Therefore, applying Equation [4] together with Equa- 
tion [l] results in Bayes-optimal 3D model-based recog- 
nition from 2D training views. 

Now that we have derived the Bayes-optimal 3D 
object recognition, let us look at some approaches 
that have been proposed in the literature for solv- 
ing the 3D object recognition problem and how they 
relate to Baycs optimal recognition. 

3D Model-Based Maximum Likelihood Meth- 
ods. Traditional approaches to model-based 3D 
computer vision (e.g., [6]) generally divide recogni- 
tion into two phases. During a model acquisition 
phase, the recognition system attempts to optimally 
reconstruct 3D models from 2D training data. Dur- 
ing the recognition phase, the system attempts to 
find the optimal match of the reconstructed 3D model 
against image data. 

This is often realized by estimating M u using 
a maximum likelihood or maximum a posteriori 
(MAP) procedure (e.g., least square methods, as- 
suming Gaussian error), M w — arg maxM P(M \T U ) 
and then performing 3D model-based recognition in 
a maximum likelihood setting using M u . 

P{B\w) = P(B\T U ) = max P{B\M,<j>) (5) 

M = argmaxP(A/|TJ (6) 

M 

It is important to remember that this approach is 
not Bayes optimal in general-it is a good approxima- 
tion only under certain conditions, for example, when 
all the distributions P(B\M, (j>) are unimodal, sharply 



peaked, and have comparable covariances. Further- 
more, computationally, the maximum likelihood esti- 
mations have proven to be fairly difficult and costly 
optimization problems. 

One reason that has made such approaches attrac- 
tive is that, as the amount of noise and variability 
become small, the reconstruction and matching prob- 
lems can be treated geometrically, and a wealth of 
results has been derived in that limit (c.f. algorithms 
like [5]). But from a statistical point of view, such ge- 
ometric approaches can be unnecessarily restrictive. 
For example, in the case in which the training set T u 
consists of only a single view T w , 3D reconstruction 
is not possible for arbitrary 3D objects. Yet, as we 
will see in the experimental results below, P(M\T U ) 
still contains considerable amounts of information. 

View Interpolation Approaches. Because the 
imaging transformation f(M, 4>) is smooth, the set 
of views Bm = {f{M,(j>)\(f> S <&} of an object it- 
self forms a smooth, low-dimensional surface in the 
space of all possible views. In fact, Bm is embedded 
in a low-dimensional linear subspace of the space of 
all possible views [TB]- The smoothness of Bm sug- 
gests that it might be learned from examples using 
a surface or function interpolation method. This has 
given rise to one of the most influential approaches 
to learning in 3D object recognition, developed by 
Poggio and Edelman [T3] . 

Methods that approximate the view manifold (e.g., 
[THESE]) generally attempt to compute some geo- 
metrically motivated distance of the target view from 
the view manifold and then perform nearest neighbor 
classification in terms of that distance. This approach 
would minimize recognition error rates if the distri- 
bution of views over the view manifolds were uniform 
and several other conditions were satisfied. However, 
most work on geometric and interpolation methods 
does not demonstrate Bayes-optimality of the classi- 
fication error, but only proves results about the qual- 
ity of the approximation to the view manifold that 
they achieve. In general, a good approximation to 
the view manifolds is neither necessary nor sufficient 
for Bayes-optimal recognition (although it does often 
seem to work reasonably well). 
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Classification Approaches. Many classification 
methods (multi-layer perception, logistic regression, 
mixture discriminant analysis, etc.) are concerned 
with estimating posterior distributions like P(uj\B) or 
corresponding discriminant functions directly. They 
share with the methods described in this paper that 
they do not necessarily involve the two-step maxi- 
mization procedure used in traditional model-based 
systems (Equations [5] and [6]) . Classification methods 
have not been all that popular for 3D object recogni- 
tion in the past, but there has been some recent work 
on it (e.g., [T5]). 

Single- View Generalization. Based on geomet- 
ric considerations alone, if nothing else is known 
about a 3D object, multiple views of an object are 
needed in order to reconstruct a 3D model of the 
object from views (e.g., [5]). Generalization from a 
single view is usually only considered possible when 
the object is known to have special properties like 
symmetry or when the object is known to be a mem- 
ber of some other kind of object class (e.g., [Tf]). 
Geometrically, of course, this is true. Statistically, 
however, even if 3D model reconstruction is not pos- 
sible, P(B\T U) ) may still contain information permit- 
ting significant single view generalization, as the ex- 
periments below will show. 

3 View Generalization Func- 
tions 

We have seen that previous approaches to learning 
object models have concentrated on learning /m(w), 
P(u>\B), or P(B\ui). This paper proposes and exam- 
ines a different learning problem for 3D object recog- 
nition: the direct estimation of the view generaliza- 
tion function, defined as follows: 

Definition 1 We define the r-view generaliza- 
tion function as the conditional density P{B\T UJ ) — 
P(B\T Ut i, . . . , T u , r ) given by Equation^ 

If the training set T u consists of a single view T w , 
we call this a single view generalization function. No- 
tice that view generalization functions are functions 



of views only; they do not involve any object models. 
In some sense, they tell us how much an unknown 
view is similar to a set of training views. 

If we have a good estimate of the view general- 
ization function, we can perform Bayes-optimal 3D 
object recognition by a generalized nearest neighbor 
procedure with a variable metric, somewhat analo- 
gous to the procedure in [3] . 

That is, the vision system initially builds a good 
approximation of the view generalization function 
P{B\T UJ ) from visual input. This might require a 
lot of training data, corresponding perhaps to sev- 
eral years of visual input after birth in human vision. 

Once a vision system has acquired a fairly good 
approximation of P(B\T U ), the acquisition of new 
object models merely required storing the training 
views T u . Let us assume that training views are 
unambiguous, P(cj|T w ) = 1 (otherwise, the proce- 
dure is still optimal /c-nearest neighbor but does not 
necessarily achieve Bayes-optimal classification rates 
[3]). Given the view generalization function and a 
collection of training views for each object, Bayes- 
optimal recognition of an unknown view B against 
the model base can then be carried out by evaluating 
P(B\T Ui ) P(u>i) for each object u>i under considera- 
tion and classify according to Equation [l] Further- 
more, if the view generalization function P(_B|7^,) can 
be implemented in a low-depth circuit, the visual sys- 
tem will be able to carry out Bayes-optimal recog- 
nition of novel 3D objects from 2D training views 
quickly, without the need for the optimizations im- 
plicit in traditional maximum likelihood approaches 
used in computer vision (see Equations [5] and [6]) . 

Of course, whether this approach works hinges cru- 
cially on whether it is possible to learn an approxi- 
mation to the view generalization function that actu- 
ally generalizes to novel objects and has the desired 
properties. If every new object the system encounters 
requires updating of the estimate of the view general- 
ization function and the approach effectively reduces 
to traditional one-by-one learning of object models. 
If, on the other hand, after an initial set of training 
examples, the estimate of P(B\T LU ) generalizes rea- 
sonably well to previously unseen objects, then the 
approach is successful. 
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The rest of this section will explore these issues fur- 
ther with examples and some theoretical arguments. 
Subsequent sections will provide some experimental 
evidence that learning view generalization functions 
is feasible. 

Smoothness of the View Generalization Func- 
tion. Intuitively, we would expect that, for most 
objects and views, if the set of training views for 
two objects is similar, the distributions P{M\%j) of 
possible corresponding object models are similar as 
well, and so are the distributions P(B\M) of other 
possible views. This corresponds to a statement 
about the smoothness of the view generalization func- 
tion. It can be demonstrated formally for specific 
model distributions, camera and noise models by dif- 
ferentiating Equation [4] with respect to B and the 
T 

Such smoothness properties suggest that the view 
generalization function may be learnable using tech- 
niques like radial basis function (RBF) interpolation 
or multilayer perceptrons (MLPs) that take advan- 
tage of smoothness; |14j use a similar argument to 
motivate the use of RBFs for learning individual view 
manifolds. 

Note that, in contrast to the view generalization 
function, the maximum likelihood solutions given by 
Equations [5] and [6] and used in many computer vision 
systems, when viewed as functions of the target and 
training views, are not necessarily smooth and there- 
fore probably not easily approximated using models 
like RBFs. 

Model Priors. One of the important properties of 
the view generalization function is that it does not 
depend on the specific models the observer has ac- 
quired in his model base. Rather, it depends on the 
prior distribution of models from which the actual 
models encountered by the system are drawn. 

Theorem 1 The view generalization function is 
completely determined by the prior distribution of 3D 
models P(M), the distribution of viewing parameters 
P(4>), the noise distribution P(N), and the choice of 
imaging model f(M,(f>). 



Proof. In analogy to Equation [2] we have for 
a training view T u , P(T U \M) = J 5{T UJ \f{M,(j)) + 
N) P((f>) P(N) dip dN. Since the training views are 
(by assumption) drawn independently, P(T U \M) = 
Y[ T eT P(T U \M). Using Bayes formula, we invert 
this"to"yield P{M\%). Furthermore, P{B\M,<j>) = 
8(T u \f(M, (f>) + N) P(N) dcpdN. With this, we have 
all the components to evaluate Equation |4j □ 

Linear Combination of Views. Let us now 

turn to the question of whether fast, or even low- 
depth arithmetic circuit, implementations of view 
generalization functions are plausible. To do this, 
we will recast two commonly used approaches to 
3D object recognition, linear combination of views 
|16j and eigenspace methods (below), into a view- 
generalization function form. The resulting view gen- 
eralization functions implement those models exactly 
and hence would perform identically to those meth- 
ods if implemented. 

In a linear combination of views framework, we test 
whether a novel target view B can be expressed as 
a linear combination of training views. Let us as- 
sume concretely that we want to generalize based 
on three training views per object, P(B\T±, 7*2, T3) = 
g(B, Ti, T 2 , T 3 ). The error e by which we judge sim- 
ilarity is the magnitude of the residual that remains 
after the linear combination of training views has 
been subtracted. Performing nearest neighbor classi- 
fication using e corresponds to assuming any of a wide 
number of unimodal, symmetric distributions U for 
e; that is, nearest neighbor classification using linear 
combination of views is the same as classifying using 
the conditional density P(B\T 1 ,T 2 ,T 3 ) = 17(e). If we 
write p v (x) = x — j^v for the residual that remains 
after subtracting the projection of x onto v from x, 
then we can compute e as e = \\pt 3 (pt 2 (pt 1 (B)))\\, 
and the linear combination of views (LCV) view gen- 
eralization function gLCv(-B, 7\, T2, T3) = U(e) = 
^(IIPt 3 (pt 2 (pti(5)))||)- Generalizing to r training 
views, we can clearly compute this with an arith- 
metic circuit of depth proportional to r. Therefore, 
we have seen that if we use a linear combination of 
view model of object similarity, then the view gen- 
eralization function can be expressed as a fairly sim- 
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pic function that can be implemented as a circuit of 
depth proportional to the number of views r. 

Eigenspace Methods. Eigcnspace methods and 
related techniques have been used extensively in in- 
formation retrieval (latent semantic analysis, LSA) 
and computer vision |13L 112] . In general, in 
eigenspace methods, given a set of training views Tj 
for multiple objects, we compute a low-dimensional 
linear subspace S and evaluate similarity among a 
target view B and a training view T u within that 
low-dimensional subspace. That is, eigenspace meth- 
ods use an error e — \\Prs(B) — Prs(B)\\ for near- 
est neighbor classification, where Prs is the linear 
projection operator onto S. This procedure can be 
justified, for example, when the training samples Tj 
falls into a low-dimensional linear subspace in the er- 
ror free case, but are corrupted with Gaussian noise 
whose magnitude is small compared to the variability 
of the training samples. Then, if we determine the 
covariance matrix of the Ti, its large eigenvalues will 
correspond approximately to directions representing 
meaningful object variability, while its small eigenval- 
ues will correspond approximately to directions rep- 
resenting only noise [1]. 

As before, nearest neighbor classification using e is 
equivalent to choosing some unimodal error distribu- 
tion U(e) (e.g., Gaussian) and approximating 

P(B\%) ex max U(e) = max U(\\P S (B) - P S (B)\\) 

(7) 

Therefore, we can view eigenspace methods as a very 
simple form of learning a view generalization func- 
tion; the function has the specific form given in Equa- 
tion [7] with only the projection operator Prs being 
learned by the observer. 

4 First Order Single View 
Model 

In this section, we will look at a simple experimen- 
tal evaluation of single view generalization functions, 
applied to simulated 3D paperclips. Simulated 3D 
paperclips are widely used in computational vision, 



psychophysical experiments, and ncurophysiological 
work (e.g., [HE]). Let us briefly review the model 
here and state the parameters used in this and the 
next section. 

Random 3D models are generated by picking five 
unit vectors in K 3 with uniformly random directions 
and putting them end-to-end. To obtain a 2D view of 
the object, the 3D model is rotated by some amount 
and then projected orthographically along the z axis. 
Views are centered so that the centroid falls at the 
origin. 

For all the experiments involving paperclips below, 
the training set consisted of random views derived 
from a fixed set of 200 randomly constructed 3D clip 
models. That is, all generalization to arbitrary, pre- 
viously unseen 3D clip models was derived from infor- 
mation learned from this small, fixed sample of 200 
clips. 

For each test trial, novel previously unseen 3D clip 
models were generated randomly and random views 
of those clips were generated by random rotations 
in the range [—40°, +40°] around the x and y axes 
relative to the training view; this range of rotations 
was chosen because it is comparable to what previ- 
ous authors have used and seems to be at the limit 
of human single view generalization ability for these 
kinds of images (e.g., [T4j). 

In order to be accessible to a learning algorithm, 
these views need to be encoded as a feature vector. 
Three kinds of encodings have been commonly used 
in the literature and arc used in this paper. An an- 
gular encoding uses the ordered sequence of angles 
around each vertex in the projected image, giving 
rise to a four-dimensional feature vector. An ordered 
location encoding uses the concatenation of x and 
y coordinates, in sequence, as its feature vector, re- 
sulting in a 10 dimensional feature vector. A feature 
map encoding projects the vertices of the clip onto a 
bounded grid composed of 40 x 40 buckets, resulting 
in a binary feature vector of length 1600. 

Single View Generalization. Let us now look at 
building an empirical distribution model of P{B\T U] ). 
We will limit ourselves to single-view generalization 
models] that is, we assume that the set of train- 



6 



ing views for an object u> consists of a single view 
%j = {T^}. Note that this problem has not been 
studied much in computer vision; this is perhaps be- 
cause, based on geometry alone, a training set con- 
sisting of a single view T w does not permit reconstruc- 
tion of the 3D structure of an arbitrary object even 
in the error-free case. However, as several authors 
have observed (e.g., |14j). human observers are ca- 
pable of a significant degree of 3D generalization, so 
there is reason to believe that 3D recognition based 
on P{B\Tu), that is, recognition based solely on a 
single training view is possible, at least to some de- 
gree. 

First Order Approximation. For concreteness, 
let us assume the feature map representation of views 
discussed above. In that representation, a view B is a 
binary feature vector B = (Bi, . . . , B r ), where each 

Bi represents a pixel or bucket in the image, and Experimental Results. Using the paperclip mod- 
analogously for T. We can try to model P(B\T) as a els, we can estimate the parameters of the first order 
an expansion [TU]: model above by simulation: we repeatedly generate 

different views of objects, compute their feature vec- 
^ tors, and compute the frequency of co-occurrence of 

log P(B\T) w -(/i^+^/i^B^T^+^/i^^ 

ij ijk kind of Hebbian learning). This allows us to visual- 

(8) ize the non-linear blurring that results in single-view 
generalization. An example of this is shown in Figure 



"geometric blur" of an object. The results sketched 
in this section make the connection between non- 
uniform geometric blurring and first order approx- 
imations to the single view generalization function, 
g(B,T) = P(B\T). This connection lets us deter- 
mine more precisely how we should compute geomet- 
ric blurring, what approximations it involves com- 
pared to the Bayes-optimal solution, and how we can 
improve those approximations to higher-order statis- 
tical models. Let us note also that there is nothing 
special about the representation in terms of feature 
maps; had we chosen to represent views as collec- 
tions of feature coordinates, a first order approxima- 
tion would have turned into error distributions on the 
location of each model feature. 



Here, the are functions of their boolean-valued 
arguments. The different correspond to taking 
account increasingly higher-order correlations among 
features. 

Of particular interest is the "first-order" approxi- 
mation, for which we take into account only and 
Let us look at the probability that pixel Bi in 
the view B is "on" given the training view T: 



log P(B. L = l\T) cx const - 



2_, hi i 



But this means that if we look at log P(Bi\T), it is a 
blurred version of the training view, with with hij as 
a spatially varying blurring kernel. 

Blurring, with or without spatially variable ker- 
nels, has been proposed as a means of generaliza- 
tion in computer vision by a number of previous au- 
thors. In a recent result, [2] derives non-uniform 
blurring for 2D geometric matching problems, the 



m 

Note that, similar to [2 , there is more blurring 
further away from the center of the object. How- 
ever, the two approaches differ in that geometric blur 
does not take into account, among other things, the 
prior distribution of models P(M) and hence does not 
necessarily result in Bayes optimal performance when 
applied to object recognition problems, while the em- 
pirical statistical model of view similarity used here 
approximates the true class conditional distribution. 

In terms of error rates in a forced choice experi- 
ments, view similarity using these non-uniform blurs 
achieves an error rate of 7.2%, compared to 32% us- 
ing simple 2D similarity, demonstrating substantial 
improvements from the use of the view similarity ap- 
proach. Note also that because of the nature of the 
feature vector used-a 2D feature map-the system did 
not have access to correspondence information. 
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5 View Similarity Models 

Densities like the view generalization function 
P(B\T UJ ) can be difficult to estimate, ft would be 
more convenient if we could reformulate the learning 
problem as that of modeling a class posterior den- 
sity: there is a wide variety of models available for 
class posterior density (logistic regression, radial ba- 
sis functions, multilayer-perceptrons, etc.) 

Fortunately, we can perform that transformation 
fairly easily. During recognition from a model base, 
we compare the unknown view P repeatedly against 
collections of training views 7^ for each object. There 
are two conditions under which this takes place: ei- 
ther the view B derives from the same object u> as 
the training views 7^, or the view derives from some 
other object. Let us represent these two conditions 
by a boolean indicator variable S. For B not derived 
from u), the conditional distribution P(B\S = 0,7^) 
is simply the prior distribution of possible views 
P(B). When B is derived from the same object as 
the training views, that is S = 1, we have: 



P{B\S=l,%) = P{S=1\B,%) 



P(B) 



P{S = 1\%) 



Given an unknown view B to recognize, P(P) does 
not change with u>, and P(S = = P(uj). There- 

fore, 

ui = argmaxP(P|7L)P(cj) = argmaxP(5 = 1\B,%) 

Let us call the distribution P(S = 1|P,7^) the view 
similarity function. If 7^ consists of a single view, we 
call this distribution the single view similarity func- 
tion. It acts like an adaptive similarity metric [9] 
when used for recognition from a model base using 
Equation [T] 

Experiments. Let us look now at how view simi- 
larity functions can be learned in an the case of 3D 
paperclips. As in the previous section, we consider 
the single view generalization problem and apply it to 
the problem of paperclip recognition. During a train- 
ing phase, the experiments used a collection of 200 
paperclips, generated according to the procedure de- 
scribed in the previous section. The procedure used 




Figure 3: (a) Sample images from the COIL-100 
database, (b) The feature map used as input to the 
recognition system. 



for generating the paperclips implies the prior dis- 
tribution P(B) — P(TJ), and the training set is a 
sample from this distribution. For training, the sys- 
tem chooses one of those paperclips to at random and 
generates two different views, a training view T w , and 
a target view B. Then, it picks a second paperclip 
<J 7^ ui at random and generates a view B'. The pair 
(B,T U ) is then a training example for the condition 
5=1, and the pair {B' , T u ) is a training example for 
the condition 5 = 0. Generating a number of these 
pairs, we obtain a training set for a Bayesian classifier 
P(S\B,T). 

For testing, the experiment was carried out using 
novel paperclips-paperclips not found in the train- 
ing set of 200 paperclips. We could test by gener- 
ating a model base of some number of objects and 
then performing nearest neighbor classification; we 
will do that below on the COIL-100 database of real 
images. However, that introduces another unneces- 
sary parameter into the evaluation, the size of the 
model base. Therefore, here, we reduce the recogni- 
tion problems on a forced choice experiment. In such 
a forced-choice experiment, we generate test samples 
analogous to training samples and measure the error 
rate of the system on being able to distinguish (P, TJ) 
from (P', T w ). This is also a common paradigm used 
in psychophysical experiments. An example of such 
a forced choice experiment can be seen in Figure [T] 
the image at the left is the training view T u , and the 
two images on the right correspond to P and P' (not 
necessarily in that order). Views were encoded us- 
ing the three feature types described in the previous 
section; for location features, rotations were chosen 
from {±45°}. 
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2D Similarity 



View Similarity 



Error Rate 



Angles Locations 



19.9% 



8.4% 



10.9% 



0.38% 



Feature Map 



32% 



7.9% 



These results show a substantial improvement of 
view-similarity functions over 2D similarity on single 
view generalization to novel objects. Note that many 
traditional recognition methods, like linear combi- 
nations of views or model-based recognition, cannot 
even be applied to this case because the observer is 
only given a single training view for each novel object. 

6 Experiments with COIL-100 

The experiments in the previous sections were all car- 
ried out on simulated 3D paperclip objects-a widely 
used test case in the literature. However, real-world 
images might show considerably more variation and 
hence make the learning of view generalization func- 
tions hard or impossible from reasonable numbers of 
training images. 

To test whether view similarity methods are ap- 
plicable to real images, experiments were carried out 
on the COIL-100 database [T^]. Furthermore, the 
eigenspace method used in [IT] was implemented as 
a control. 

The COIL-100 database contains color images rep- 
resenting views of objects separated by 5° rotation 
around the vertical axis. Even simple nearest neigh- 
bor classification methods perform nearly perfectly 
given that sampling and color input, so using the full 
database as training examples is not a very hard test 
of the ability to generalize to new views based on 
shape. 

To test for the ability to generalize to viewpoints 
that differ substantially from the training view based 
on shape alone, the database was preprocessed to re- 
move color and absolute intensity information, and 
only a coarser sampling of viewpoints was used. Im- 
ages were converted to grayscale and gradient fea- 
tures were extracted, as shown in Figure [3] Training 
was carried out on views from the first 70 objects 
in the database. The methods were tested on views 
from the remaining 30 objects of the database. For 



each test, only collections of views whose viewpoints 
were spaced apart by multiples of 30° (12 per object) 
were used. 

The question addressed by these experiments on 
the COIL-100 database is whether it is possible to 
learn view generalization functions that are capable 
of any kind of generalization at all. Note that the 
view similarity model had no prior knowledge incor- 
porated into it at all, not even Euclidean distance. 
Without effective training, the view similarity func- 
tion performs at chance level, an error rate of 96.7%. 
Any performance better than that means that the 
view similarity model successfully generalized at least 
to some degree from the 70 training objects to the 30 
previously unseen test objects. Error rates for this 
recognition problem are shown in the following table 
(measured for 2160 test views): 





Error Rate 


Euclidean Distance 


40.0% 


Eigenspace 


26.1% 


View Similarity 


20.3% 



As expected, the eigenspace method results in 
strong improvements over a Euclidean Distance clas- 
sifier. The view similarity approach with a MLP 
model of P(S\B,T UJ ) and five hidden units, results 
in addition decrease of the error rate of nearly six 
percent, showing not only that significant generaliza- 
tion has taken place between different object models, 
but that even given a very small training set of 70 
objects, the method actually outperforms an estab- 
lished approach to object recognition]^] 

7 Discussion 

This paper has introduced the notions of view gen- 
eralization and view similarity functions. We have 



3 Of course, even better performance can be achieved by 
hardcoding additional prior knowledge about shape and ob- 
ject similarity into the recognition method (e.g., pQ). Achiev- 
ing competitive performance with such methods would either 
require encoding additional prior knowledge about shape sim- 
ilarity in the numerical model of the view similarity function, 
or simply using a much larger training set to allow the observer 
to learn those regularities directly. 
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seen that knowledge of these functions allows an ob- 
server to recognize novel objects from a set of train- 
ing view in a Bayes optimal (minimum classification 
error) way. 

By expressing eigenspace and linear combination 
of view methods in the framework of view general- 
ization functions, the paper has demonstrated that 
fast and compact view generalization functions exist 
that are at least as good as commonly used methods 
for object recognition. Furthermore, the paper has 
given a procedure for constructing the Bayes optimal 
blurring for matching, a Baycsian version of the ge- 
ometric blur method in [2], and shown such blurring 
methods to be first order approximations to the view 
generalization function. 

The paper also reported experiments on the recog- 
nition of simulated 3D paperclips, as well as the 
recognition of real objects from the COIL-100 image 
database of real 3D objects. In the case of paper- 
clips, a set of 200 training objects sufficed to reduce 
the error rate on single view generalization several- 
fold compared to 2D view similarity. And in the case 
of the COIL-100 database, the use of view similar- 
ity cut the recognition error rate in half compared to 
image based similarity. This is also one of the first 
demonstrations of learning single view 3D generaliza- 
tion for novel objects without requiring membership 
in a special object class. 

Both the theoretical arguments and the experi- 
ments presented in this paper were only designed 
to showed that view generalization approaches are 
feasible. We would have expected learning of view 
generalization functions to require a large number of 
training objects. But experimental results surpassed 
expectations and show that view generalization and 
view similarity functions that can show significant 
amounts of generalization (and actually outperform 
eigenspace methods) to arbitrary previously unseen 
objects are learnable from very modest numbers of 
training examples (70 and 200). 

Future work has to address a number of practical 
and engineering issues. 

The experiments in this paper demonstrated 
single- view generalization. This was perhaps the 
more interesting case to address first since few other 
methods for 3D object recognition are even capable 



of performing meaningful 3D generalization from a 
single view of an unknown 3D object. The exten- 
sion of this to multi-view generalization requires some 
additional tricks; in particular, instead of learning 
P(S = X\B,T Wt x, . . . ,T Wjr ), it turns out to be desir- 
able instead to learn P(S = /(T^i, . . . , T u>r )) 
for a function / that "summarizes" the views in a 
way that makes it easier to learn the view similarity 
function. 



The statistical models used in the experiments 
in this paper (empirical distributions and multilayer 
perceptrons) incorporated no prior knowledge about 
objects or shape similarity. Work on appearance- 
based 3D object recognition under 2D transforma- 
tions (e.g., PQ, among many others) show that sys- 
tems based on hardcoding knowledge about trans- 
formations and shape similarity into view similarity 
measures can by themselves achieve a significant abil- 
ity to generalize across different 3D views. Such tech- 
niques can be combined with the adaptive view gener- 
alization approaches presented in this paper. If such 
hybrid systems are constructed carefully, they will 
perform no worse than the underlying systems using 
hardcoded similarity measures, but have the poten- 
tial to improve their performance adaptively. Demon- 
strating this also remains for a future paper. 



And while it is interesting that view similarity and 
view generalization methods can already learn some 
generalization from as few as 70 images, training on 
much larger datasets is clearly desirable. After all, we 
are trying to approximate a similarity measure that 
performs Bayes-optimal recognition over the entire 
distribution of possible 3D shapes. Fortunately, it is 
easy to generate large amounts of training data with- 
out manual labeling from video sequences, by taking 
advantage of the fact that video is often composed of 
scenes within which individual objects undergo mo- 
tion relative to the camera; frames from such scenes 
provide training samples for P(S — l\B, T^), while 
frames from different scenes can be used as training 
samples for P(S = 0\B,T u ). 
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